Key Factors Driving Parents to Immunized Their Children in India: a Preliminary Research

by Mochammad Miftahul Fahmi (5972035)

Hello, everyone! I hope you a beautiful day.

My name is Miftahul Fahmi, currently studying Management of Technology in Delft University of Technology. Here are some related assignments of my data science courses, which I take on November 2023 - April 2024. So, there will be more of codes that I will post here.

Seriously, why I need to post my codes here? The answer is simple:

  • I love programming! and data science sparks a new spirit for me. Putting my code here will make me easier to review my old codes for future needs, anytime and anywhere (instead of checking on my old folders). Also, I believe my way of coding (I got the way of coding from class lab, and some with the help of ChatGPT. Yes but i dont blindly copy paste) and explanation will help you.
  • I posted assignments that has high grades only (it means that the supervisors could guarantee the quality).

Please, enjoy!

WHAT WILL WE DO HERE?

This code here will focus on Geographical Visualization and Analysis.

The steps:

  1. Data cleaning and exploring (Part 01 - 03 below, please check the table of content): using geopandas, numpy. Visualization: OSMNx (can bring ammenities to a geo map), Seaborn (for visualization of map and area), Matplotlib (one of the easiest way to create graph).
  2. Create choropleth (a map which uses differences in shading, colouring, or the placing of symbols within predefined areas to indicate the average values of a particular quantity in those areas). This is on Part 04.
  3. Adding amenities to 3 case study districts in India: Pathanamthitta, Anjaw, Bandara. This is on Part 04.
  4. Exploratory Spatial Data Analysis: Making a Spatial Weight, Analysis on Global and Local (using LISA / Spatial Autocorrelation). This is on Part 05.

The program will be structured as below¶

  • Part 01: Download Data and Initial Data Checking
  • Part 02: Research Formulation
  • Part 03: Data Cleaning and Preparation
  • Part 04: Spatial Visualization
  • Part 05: Exploratory Spatial Data Analysis
  • Part 06: Conclusions and Reflection

Part 01: Initial Data Check ¶

In this part, main activities are:

  1. Declare the library. I will use mainly pandas for data cleaning and reshaping. Matplotlib is used for plot visualization. Seaborn for spatial visualization.
  2. Read the csv file using panda read_csv. The data is located in the ./data folder
  3. Check at a glimpse of the data: head.
  4. Check the number of row and columns using "shape"
  5. Check unique column (variables) and row (index) using "unique"
In [1]:
# Declaration of Used Library

# For dataframe processing (shp and csv)
import pandas as pd
import geopandas as gpd
import numpy as np
from scipy.stats import pearsonr
from libpysal.weights import Queen

# For mapping geographical features
import osmnx as ox
import seaborn as sns
import contextily as ctx
from pysal.lib import weights
from pysal.explore import esda
from splot.esda import moran_scatterplot, lisa_cluster, plot_local_autocorrelation

# For plotting
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
from shapely.geometry import Point
%matplotlib inline
import mapclassify

%matplotlib inline
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/numba/core/decorators.py:262: NumbaDeprecationWarning: numba.generated_jit is deprecated. Please see the documentation at: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-generated-jit for more information and advice on a suitable replacement.
  warnings.warn(msg, NumbaDeprecationWarning)
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/quantecon/lss.py:20: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def simulate_linear_model(A, x0, v, ts_length):
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/spaghetti/network.py:40: FutureWarning: The next major release of pysal/spaghetti (2.0.0) will drop support for all ``libpysal.cg`` geometries. This change is a first step in refactoring ``spaghetti`` that is expected to result in dramatically reduced runtimes for network instantiation and operations. Users currently requiring network and point pattern input as ``libpysal.cg`` geometries should prepare for this simply by converting to ``shapely`` geometries.
  warnings.warn(dep_msg, FutureWarning, stacklevel=1)

Read the CSV file using read_csv. The csv was downloaded from open source Mission Antyodaya Village Facilities in SHRUG (https://www.devdatalab.org/shrug_download/). Information can be found here .

The analysis will be focused on "Subdistrict" and "District" level because of some reasons (explained on the Reflection at the bottom of Part 01).

In [2]:
# Read CSV file, at desired level
initial_data = pd.read_csv("./data/shrug-antyodaya-csv/antyodaya_pc11subdist.csv")
# The content of the CSV will be explained on Part 02

# Check the data at a glimpse using head
initial_data.head()
Out[2]:
pc11_state_id pc11_district_id pc11_subdistrict_id total_population male_population female_population total_hhd total_hhd_engaged_in_farm_activi total_hhd_engaged_in_non_farm_ac is_govt_seed_centre_available ... open_uncovered_drainage open_kuchha_drainage fpo no_electricity internal_pucca_road livestock_ext_services _mean_p_miss _core_p_miss _target_weight_share _target_group_max_weight_share
0 3 41 225 778.0 407.0 371.0 155.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 1.0 1.0
1 12 245 1563 446.0 216.0 230.0 95.0 40.0 45.0 0.0 ... 0.0 1.0 0.0 0.00000 0.00000 0.0 0.0 0.0 1.0 1.0
2 12 250 1626 603.0 253.0 350.0 129.0 110.0 3.0 0.0 ... 0.0 0.0 0.0 0.45937 0.54063 0.0 0.0 0.0 1.0 1.0
3 12 253 1679 1380.0 828.0 552.0 228.0 120.0 70.0 0.0 ... 0.0 1.0 0.0 0.00000 0.00000 1.0 0.0 0.0 1.0 1.0
4 21 371 2774 297.0 155.0 142.0 70.0 55.0 15.0 0.0 ... 0.0 0.0 0.0 0.00000 1.00000 0.0 0.0 0.0 1.0 1.0

5 rows × 171 columns

Next, check the number of row and column, also each of the unique.

In [3]:
# Review the number of records (row) and columns
print("Check the Number of Rows and Columns")
num_rows, num_columns = initial_data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")
Check the Number of Rows and Columns
Number of rows: 5480
Number of columns: 171
In [4]:
# Get the name of the column (column title)
column_names = list(initial_data.columns)
print("The name of the columns of the data:" )

for i, column_name in enumerate(column_names, start=1):
    print(f"{i}. {column_name}")
The name of the columns of the data:
1. pc11_state_id
2. pc11_district_id
3. pc11_subdistrict_id
4. total_population
5. male_population
6. female_population
7. total_hhd
8. total_hhd_engaged_in_farm_activi
9. total_hhd_engaged_in_non_farm_ac
10. is_govt_seed_centre_available
11. availability_of_watershed_dev_pr
12. availability_of_rain_harvest_sys
13. availability_of_food_storage_war
14. availability_of_farm_gate_proces
15. availability_of_custom_hiring_ce
16. total_cultivable_area_in_hac
17. net_sown_area_in_hac
18. net_sown_area_kharif_in_hac
19. net_sown_area_rabi_in_hac
20. net_sown_area_other_in_hac
21. is_soil_testing_centre_available
22. is_fertilizer_shop_available
23. no_of_farmers_using_drip_sprinkl
24. area_irrigated_in_hac
25. total_unirrigated_land_area_in_h
26. availability_of_milk_routes
27. availability_of_poultry_dev_proj
28. availability_of_goatary_dev_proj
29. availability_of_pigery_developme
30. is_veterinary_hospital_available
31. availability_of_fish_community_p
32. availability_of_fish_farming
33. availability_of_aquaculture_ext_
34. total_hhd_with_kuccha_wall_kucch
35. total_hhd_have_got_pmay_house
36. total_hhd_in_pmay_permanent_wait
37. total_hhd_got_benefit_under_stat
38. total_hhd_in_perm_waitlist_under
39. piped_water_fully_covered
40. is_village_connected_to_all_weat
41. public_transport
42. availability_of_railway_station
43. total_hhd_having_pmsbhgy_benefit
44. availability_of_elect_supply_to_
45. availability_of_solor_wind_energ
46. total_hhd_having_solor_wind_ener
47. availability_of_panchayat_bhawan
48. csc
49. public_info_board
50. is_common_pastures_available
51. total_hhd_availing_pmuy_benefits
52. availability_of_public_library
53. recreational_centre
54. is_bank_available
55. is_bank_buss_correspondent_with_
56. is_atm_available
57. total_hhd_availing_pmjdy_bank_ac
58. is_post_office_available
59. telephone_services
60. is_broadband_available
61. is_pds_available
62. total_hhd_having_bpl_cards
63. availability_of_primary_school
64. is_primary_school_with_electrici
65. is_primary_school_with_computer_
66. is_primary_school_with_playgroun
67. is_primary_school_have_drinking_
68. availability_of_mid_day_meal_sch
69. total_primary_school_students
70. total_primary_school_teachers
71. availability_of_middle_school
72. availability_of_high_school
73. availability_of_ssc_school
74. no_of_children_not_attending_sch
75. availability_of_govt_degree_coll
76. total_grd_and_pg_in_village
77. is_vocational_edu_centre_availab
78. total_trained_trainee_under_skil
79. availability_of_jan_aushadhi_ken
80. total_hhd_registered_under_pmjay
81. is_community_waste_disposal_syst
82. total_hhd_with_clean_energy
83. is_community_biogas_waste_recycl
84. is_aanganwadi_centre_available
85. is_early_childhood_edu_provided_
86. total_childs_aged_0_to_3_years
87. total_childs_aged_0_to_3_years_r
88. total_childs_aged_3_to_6_years_r
89. total_childs_aged_0_to_3_years_i
90. total_childs_categorized_non_stu
91. total_anemic_pregnant_women
92. total_anemic_adolescent_girls
93. total_underweight_child_age_unde
94. total_male_child_age_bw_0_6
95. total_female_child_age_bw_0_6
96. total_minority_children_getting_
97. total_minority_hh_provided_bank_
98. no_of_physically_challenged_recv
99. total_hhd_with_more_than_2_child
100. availability_of_mother_child_hea
101. total_hhd_availing_pension_under
102. total_shg
103. total_hhd_mobilized_into_shg
104. total_no_of_shg_promoted
105. total_hhd_mobilized_into_pg
106. total_shg_accessed_bank_loans
107. is_bee_farming
108. is_sericulture
109. is_handloom
110. is_handicrafts
111. availability_of_community_forest
112. availability_of_minor_forest_pro
113. total_hhd_source_of_minor_forest
114. availability_of_cottage_small_sc
115. total_hhd_engaged_cottage_small_
116. availability_of_adult_edu_centre
117. total_no_of_registered_children_
118. total_no_of_children_0_to_6_year
119. total_no_of_pregnant_women
120. total_no_of_pregnant_women_recei
121. total_no_of_lactating_mothers
122. total_no_of_lactating_mothers_re
123. total_no_of_women_delivered_babi
124. total_no_of_children_in_icds_cas
125. total_no_of_young_anemic_childre
126. total_no_of_newly_born_children
127. total_no_of_newly_born_underweig
128. total_hhd_not_having_sanitary_la
129. total_no_of_eligible_beneficiari
130. total_no_of_beneficiaries_receiv
131. gp_total_no_of_eligible_benefici
132. gp_total_no_of_beneficiaries_rec
133. gp_total_hhd_eligible_under_nfsa
134. gp_total_hhd_receiving_food_grai
135. total_no_of_farmers_registered_u
136. total_no_of_farmers_subscribed_a
137. total_no_of_farmers
138. total_no_of_farmers_received_ben
139. total_no_farmers_adopted_organic
140. total_no_of_farmers_add_fert_in_
141. total_no_of_elected_representati
142. total_no_of_elect_rep_oriented_u
143. total_no_of_elect_rep_undergone_
144. total_approved_labour_budget_for
145. total_expenditure_approved_under
146. total_area_covered_under_irrigat
147. total_hhd_having_piped_water_con
148. pc11_land_area
149. canal
150. surface_water
151. ground_water
152. other_irrigation
153. any_primary_sch_toilet
154. mandi
155. regular_market
156. weekly_haat
157. phc
158. chc
159. sub_centre
160. closed_drainage
161. open_covered_drainage
162. open_uncovered_drainage
163. open_kuchha_drainage
164. fpo
165. no_electricity
166. internal_pucca_road
167. livestock_ext_services
168. _mean_p_miss
169. _core_p_miss
170. _target_weight_share
171. _target_group_max_weight_share

Reflection

Some important reflection based on the decision of which data will be the main set:

A. The main dataset chosen

The main dataset will be SHRUG in subdistrict level because of important reasons and logic: 1) Regarding one of the goal of the research: to have a preliminary research and case study to prove initial hypothesis. Next, it involves some amenities (health related facilities) that is basically more granular (closer to subdistrict level than state wide). Further will be explained in Part 02 below.

2) Regarding the data analysis process:

  • if our main data is in district level, it will be difficult in case we need to zoom in or get more granular into subdistrict level
  • if our main data is in subdistrict level, it will be easier in case we need to zoom out or get more birdview into district level (with the help of pivot).

Hence, I will use "Subdistrict" level csv as the initial main data. Furthermore, for shp file, I will use also "District" level SHP because shp file from "Subdistrict" resulting in 'many islands' during process of creating Spatial Weight Matrix (this is part of Non Linear Process).

The shrid csv level is not chosen because the availability of the data. As you can see on the SHRUG Documentation, the shrid level has 11% of missing data, while subdistrict and district level have only 7% and 4% respectively. Subdistrict-level data also may be more readily available and accurate compared to finer levels such as villages and towns (shrid).

On other hand, state level is too broad. State-level data is frequently used for policy advocacy and reporting to higher authorities, however it lacks of nuance of ammenities case study.

B. Inclusivity and Representativeness 1) Number of missing data As explained above, it is wiser to have lesser data missing in a dataset (assuming that the number of represented district is equal). The number of missing data in shrid level is above 10%, compared to other available dataset.

2) The chosen main dataset: subdistrict level contains no missing total_population data in from 5444 subdistricts and 613 districts. Based on reference [1], the data covers 5444 out of 6057 subdistricts (89%, 2018 data) and 613 out of 808 districts (76%, 2022 data), which already reached >75% of area in India. Calculation see below

C. Other Reflections using CDS Framework

  • Inequality consideration: Since some subdistricts are not available, I wish to have more completed database. However, in this case, the database is basically one of the most complete ones (explained at A above). To reduce the inequality and improve head-to-head comparison, I prevent to combine the data from different module that has different time of when the data was collected. Hence, I will focus solely on module Antodaya metadata. SHP files will of course be used, although it is from different module because geographical data does not change quickly.
  • Participation and Power consideration: The data was provided by researchers from India and lucky for us (very kind of them!), it is open source. Hence, we can use the data without forgetting to mention the source.
  • Positionality: In this research, the author (Fahmi) will have a researcher position, to always be objective on the outcome of the result and data. The truth of which factors amplify the level of immunized level, hopefully represent the voice of children and parents of Indian citizen.

[1] https://en.wikipedia.org/wiki/Administrative_divisions_of_India Note: data in the same year was hard to retrieved. Hence, I provide 2018 and 2022 census data.

In [5]:
# Here, we will calculate the number of unique values in subdistrict and district column
# However, we need to drop the NA value of total_population
# because in my opinion, total population could not be easily replaced by certain value, such as mean
# It makes more sense if we change the value to density x total area in each district or subdistrict
# This is high effort, while we just need to see the glimpse at initial time.

def calculate_num_unique_values(df, column):
    # Filter out rows where 'total_population' is missing
    filtered_df = df.dropna(subset=['total_population'])
    
    # Calculate the number of unique values for the specified column
    num_unique_values = filtered_df[column].nunique()
    
    return num_unique_values

# Calculate the number of unique values for each specified column
num_unique_values_district = calculate_num_unique_values(initial_data, 'pc11_district_id')
num_unique_values_subdistrict = calculate_num_unique_values(initial_data, 'pc11_subdistrict_id')

# Print the number of unique values for each column
print("Number of unique values in pc11_district_id:", num_unique_values_district)
print("Number of unique values in pc11_subdistrict_id:", num_unique_values_subdistrict)
Number of unique values in pc11_district_id: 613
Number of unique values in pc11_subdistrict_id: 5444

Part 02: Research Formulation ¶

In this part, main activities are:

  1. Explanation of research background
  2. Hypothesis
  3. Variables and Construct
  4. Some IMPORTANT reflection, limitation, and notes!

Research Background

COVID-19 pandemic reminds us the importance of vaccine and immunization. Through immunization, our body will develop an immune foundation, thus preventing severe illnesses and contributing to herd immunity.Hence, the earlier the foundation build, the more we contribute to the herd immunity. Therefore, immunization for children is important. It safeguards against various and potentially life-threatening diseases. Protecting this vulnerable age group is vital for long-term public health and the well-being of the community.

This research is basically a preliminary research to find further factors driving immunization level among children aged 0-3 years. As initial start, there are some factors causing the immunization level, mainly economics, social, and political factors. Niranjan (2018)[1] underpins that women from the SHGs (Self Help Group; a community organization to help family and women financially) with health intervention were more likely to overall healthier practice, including providing age-approproate immunization

Kumar (2009)[2] found that educational level of women and husband, level of income, infrastructures impacts women participation of the SHGs. In this research, we will assume that those will also impact the level of immunization.

Research Objective

This program is part of preliminary research as initial start (test the wave on water) to understand further the driving factors impacting the level of immunization level on children aged 0-3 years. In the end, this preliminary research will narrow down which variables are likely to influence the level of immunization, allowing future research to focus more on important variables (while also acknowledging that preliminary research might require further investigation).

Scope of the Data

The SHRUG data of Mission Antyodaya Village Facilities contains large dataset of shrid, subdistrict, and district level of important factors related to facilities and socio-economics feature of India collected 2018-2020.

Some important sources:

[1] Saggurti, N., Atmavilas, Y., Porwal, A., Schooley, J., Das, R., Kande, N., Irani, L., & Hay, K. (2018). Effect of health intervention integration within women's self-help groups on collectivization and healthy practices around reproductive, maternal, neonatal and child health in rural India. PloS one, 13(8), e0202562. https://doi.org/10.1371/journal.pone.0202562

[2] Participation in Self-Help Group Activities and its Impacts: Evidence from South India (Online at https://mpra.ub.uni-muenchen.de/19943/ MPRA Paper No. 19943, posted 14 Jan 2010 00:06 UTC)

Hypothesis

To give a clear picture of the hypothesis, see diagram below

#

H1: The number of graduates and post graduates (represents the education level) positively impacts the level of immunized children (age 0-3 years).

H2: The number of Self Help Groups (SHG) (represents community help) positively impacts the level of immunized children (age 0-3 years)

H3: The number of Integrated Child Development Services (ICDS) beneficiaries (represents subsidy) positively impacts the level of immunized children (age 0-3 years)

H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years)

H5: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization.

Hypothesis Explanation

  1. H1: The number of graduates and post graduates (represents the education level, of parents) positively impacts the level of immunized children (age 0-3 years). The education level will improve the cognitive ability and science-consiousness about the fact that immunization will improve immune system of our body (acknowledge that it is science based evidence). It is expected that without prior knowledge of the benefit of the immunization, people will also not willing to do that to their children.

  2. H2: The number of Self Help Groups (SHG) (represents community help) positively impacts the level of immunized children (age 0-3 years). Self Help Groups contribute to community well-being, potentially boosting child immunization rates (age 0-3) through shared resources, awareness, and support. It is expected that if a bunch group of people do immunization to their children, the willingness to provide immunization will also increase.

  3. H3: The number of Integrated Child Development Services (ICDS) beneficiaries (represents subsidy) positively impacts the level of immunized children (age 0-3 years). This hypothesis builds on the idea that subsidies offered through ICDS contribute to improved accessibility, affordability, and awareness of healthcare services, including immunization.

  4. H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years). This hypothesis is occured because I think that a well-developed healthcare infrastructure facilitates increased immunization coverage. Higher number of health facilities per area are likely to provide better access to vaccination services, reducing barriers and encouraging parents to adhere to immunization schedules

  5. H5: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization. The immunization practices of one area (subdistricts) may impact the immunization behavior of nearby area. It popped out from the idea that communities with higher immunization rates create a positive norm that influences neighboring communities.

Hypothesis Testing Method

  1. Via correlation and scatter plot: H1, H2, H3
  2. Via correlation with the help of mapping the amenities (data from ASMNX). The data that will be plotted: area with percentage of immunization level and health facility (hospital, social facilities, doctors, clinic): H4
  3. Via ESDA: H5

To improve inclusivity, H4 will focus on three districts with high, average, and low immunization level on children age 0-3 years.

# Constructs or Variables

Variables that are investigated here and the explanation: | Variables | Represented by | Meaning | Unit | Type | |----------|----------|----------|----------|----------| | The number of graduates and post graduates (per capita) | total_grd_and_pg_in_village divided by total_population | Attainment of High Level Education (represents the awareness of parenting and health) per capita | % | independent var | | The number of SHG (per capita) | total_shg divided by total_population | SHG: Self Help Group, bottom up community organization (NGO) with goals to empower woman in financial and leadership (parenting) | (per capita) | independent var | | The number of ICDS beneficiaries (lactating woman) per capita of lactating women | total_no_of_lactating_mothers_re divided by total_no_of_lactating_mothers | ICDS: Integrated Child Development Services | % | independent var | | Total number of Health Facility per area (per district) | the data will be calculated using GeoDataFrame. | The ammenities will be get from OSMNx --> plot --> calculate the number of health facilities belonged to district, after mapping | no per km2 | independent var | | the number of immunized children (per capita of number of children 0-3 years) | total_childs_aged_0_to_3_years_i divided by total_childs_aged_0_to_3_years | Percentage of children (age 0-3 yrs) with immunization | % | dependent var |

Reasons of choosing the variables

Most of the selection of the variables in this preliminary research are based on these factors:

  1. Background theory (explained above, which backbone theory that results to this hypothesis)
  2. Availability of the variables that could represents factors on no 1 above.

Explanation on each of the var (no 1-4 is independent (also no 6), no 5 is dependend)

  1. The education level will represented by number of graduates per capita. People with higher education level are expected to have base of science and rational level thinking. It is not direct to the independent var, but one of the most promising available on the Mission Antyodaya Village Facilities (among others related to education).
  2. The number of SHGs (Self Help Groups) will represents the help of the community to develop mostly women and children. The community is expected to boost the awareness of community-level immunization to children.
  3. Subsidy (push from government) will be represented by the number of beneficiaries (focused on lactating women) of ICDS (Integrated Child Development Services). It is chosen because the goal of the policy and subsidy is to push family to have healthier children in India. This project is targeted to children under 6 years of age and their mothers, so it is related to the targeted unit of analysis in this preliminary research. During the 2018–19 fiscal year, the Indian federal government allocated ₹16,335 crore (US$2.0 billion) to the programme, which is 60% of the funding for the programme while the states allocated the remaining 40% (https://en.wikipedia.org/wiki/Integrated_Child_Development_Services).
  4. Number of health facility per area (per subdistricts) is chosen because more facilities will also push the awareness of the people around it, assuming there is activities related to it. The Immunization could not be done by individuals, but should under monitor of government and health related facility.
  5. Extending the explanation on the background, immunization level is chosen because it could not occur by itself inside a community (meaning that some factors will likely to drives it).
  6. Variable added: Global and Local ESDA will be used to investigate correlation between an area with certain immunization level vs its neighbour. It is based on the human behaviour that community will likely to follow some communal behaviour around it ("social conformity" or "herd behavior" https://en.wikipedia.org/wiki/Herd_behavior#:~:text=Herd%20behavior%20is%20the%20behavior,as%20well%20as%20in%20humans.)

Inclusion beneficial: In this research, we can see some area with lower immunization level. Thus, I will choose districts (and subdistricts inside it) that represents area with immunization level: high, average, low. It is possible because of the benefits of analyzing multiple data from various districts. This inclusion will lead to more representative data on all variables (area with high/low subsidy, etc).

Reflection

A. Possible Other Variables that will improve the OUTPUT of RESEARCH.

Factor: Education Level. Alternative variables:

  • Number of people trained (this is from SHRUG dataset): I want to add this, but I could not get what type of training people get (could be very technical, instead of science based like represented by Graduate Education).
  • Number of people with training about children and parenting: I really want to add this variable because it directly represents the education that will affect on how parent will provide immunization to their children. Children might not come to clinics by theirselves, thus parents hold important role here.

Factor: Economy. Alternative variables:

  • Income level (I dont get it from the dataset): it will give direct impact to the dependent variable instead of subsidy level (because higher income could means that it has more degree of freedom to provide their children with immunization).
  • Poverty level (so I can analyze more inclusively if it is clear that certain area has below the line poverty).

Factor: Health Facilities. Alternative variables:

  • Number of clinics provide immunization. Current assumption is that all health facilities that is analyzed here could provide immunization.

B. Possible Rejecting Condition

The hypothesis explained above would be rejected in various condition:

H1 --> Rejected if education level is not align (low correlation) with the immunization level of children age 0-3 years in India.

H2 --> Rejected if the presence of SHGs does not demonstrate a statistically significant correlation with increased immunization rates.

H3 --> Rejected if the communities with a higher number of ICDS beneficiaries do not exhibit a statistically significant increase in immunization rates

H4 --> Rejected if the presence of Health Facilities does not demonstrate a statistically significant correlation with increased immunization rates.

H5 --> Less correlation of area and its neighbouring (more area with high immunization level around low area with immunization level).

C. Possible Bias

Possible biases are explained below:

C.1 Possible cases in which the hypothesis will be falsely rejected

H2 --> Rejected, but turns out it should not be rejected. Because of SHGs is heterogeneous (dynamics) across different communities about immunization level, and the analysis does not account for this variability.

H4 --> Rejected, but it is false because of certain area has lower civilization area (more empty area), thus has very low facilities!

C.2 Possible cases in which the hypothesis will be falsely accepted

H1 --> Accepted, but it is false because we could not make sure that the degree is science and health related or not. It also contains efficiency (meaning that having degree not necessarily means that the person is also have enough cognitive capability).

H3 --> Accepted, but it is false because of other subsidy from government that directly related to the immunization (and they combined each other), for example: only ICDS beneficiaries could accept the subsidy for immunization.

D. Final Reflection and LIMITATION OF THE RESEARCH!!

Important perspective that I am not taking into account:

  • I am not taking into account government policy other than subsidy of ICDS (certain area could really pushed by government by 100% of immunization level on children, for instancce because of good policy and enough budget of government).
  • I am not taking into account geographical features other than ammenities of health-related facility (it could result in bias, such as low immunization level because people are in mountainous area or remote area). This will be covered in next research.
  • Time dimension!! High level of education might not give direct impact (lower than 5 years) to the level of immunization. Observation could also results in bias because the immunized level are because of previous years effort that is not contained in the dataset!

Some important Critical Data Science framework:

  • Inclusion and Inequality: the preliminary research here (especially for H4) will focus on three area with low, high, average immunization level. For H1, H2, and H3 it will use almost all data in various area in India, guaranteeng the representativeness of each of the region. The variables above are mostly normalized to per capita also to prevent bias of high area with high population (it is likely that high population has high number of children thus resulting in high number of immunized children!)
  • Participation: NGOs, Government efforts (subsidy), mothers and children (unit of analysis) will be participated as main source of data.
  • Power: The preliminary research will include choroplets to show the distribution of the immunization level.
  • Positionality: since I am here to do preliminary research, the data will be as objective as possible (only using one dataset from SHRUG is beneficial for this part to prevent bias)

E. What I made differently this time (compared to previous assignment):

  • Use more of CDS framework! It is very useful to checklist the possible biases and makes me realize of unperfection of preliminray research (thus I believe that the level here is more like preliminary research than big research itself).

Part 03: Data Cleaning ¶

In this part, main activities are:

  1. Data subsetting
  2. Manage Missing Data
  3. Initial Data Visualization

1. Data Subsetting

Since the variables have already defined on Part 02, we will only take the columns that related to the hypothesis (see table on Part 02). However, there are also important columns that we should not exclude, for example: the id of subdistrict and district (will be used for merging with shp files), population (might be use for normalize the numbers / to get percentage).

In [6]:
selected_columns = ['pc11_state_id','pc11_district_id','pc11_subdistrict_id',
                    'total_population','male_population','female_population','total_hhd',
                    'total_grd_and_pg_in_village','total_shg',
                    'total_no_of_lactating_mothers','total_no_of_lactating_mothers_re',
                    'total_childs_aged_0_to_3_years','total_childs_aged_0_to_3_years_i'
                    ]

# Select only the columns in the 'selected_columns' array
subset_df = initial_data[selected_columns]

2. Manage Missing Data

First, we need to see which index that are NaN.

In [7]:
#Basically doing "for loops" and check every index in each of columns, is there any NA?
for x in selected_columns:
    if subset_df[x].isna().any() == True:
        na_loc = subset_df.index[subset_df[x].isna()].tolist()
        print(f"{x} with total {len(na_loc)} NaN data ({round(len(na_loc)/len(subset_df[x])*100,ndigits=2)}%), at index {na_loc}")
total_population with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
male_population with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
female_population with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
total_hhd with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
total_grd_and_pg_in_village with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
total_shg with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
total_no_of_lactating_mothers with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
total_no_of_lactating_mothers_re with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
total_childs_aged_0_to_3_years with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
total_childs_aged_0_to_3_years_i with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]

We can see that each of the columns has different index with NA. Then, we need to use our intuition to delete/replace the NA value.

Here are some ideas that I use for this case:

  1. If there is NA in SHG of certain area, we should not assume or replace the number of SHGs, because if there is zero number of SHGs, then it is possible that in certain area, there is no single SHG occured. Furthermore, the NA data counts only 0.66%, so it is insignifficant.
  2. We should not also assume the number of people in certain subdistrict, because it will need further method to replace the NA value (that is, by calculating the area of the district and multiply it by density) which requires higher effort than delete them (it counts only 0.66% too!)
  3. The same reason also applied for these variables below, because in the future, it might be needed for per capita calculation (thus, if the value is NaN or zero, it will just result in NA or 0/0). The analysis will not be applicable of there is no children in an area or even no mother there.
  • Total number of children age 0-3 years
  • Total number of lactating mothers.
In [8]:
# Remove NA value of point 1 and 2 above
columns_to_remove_na = ['total_shg', 'total_population']

# Drop rows with NA values in those columns
subset_df = subset_df.dropna(subset=columns_to_remove_na)

Also, since we will create a new column of per capita, it is possible to have 0/0 value. Then we need to delete the possible "/0" values.

In [9]:
columns_to_remove_zero = ['total_no_of_lactating_mothers']

# Drop rows where specified columns contain only 0
subset_df = subset_df.loc[~(subset_df[columns_to_remove_zero] == 0).all(axis=1)]

# DO the same for total child ages 0-3 yrs
columns_to_remove_zero = ['total_childs_aged_0_to_3_years']
subset_df = subset_df.loc[~(subset_df[columns_to_remove_zero] == 0).all(axis=1)]

After the NA data is all gone, we can do some calculation (adding important variables). We will need to make a new variables, so that it is more inclusive and fair (it is possible to have a case like the number of immunized children is higher in certain area because it has more children!).

List of new variables:

  1. Percentage of immunized children = total number of children with immunization (0-3 years) / total number of children age 0-3 yrs.
  2. Percentage of people with higher education = total number of graduates(or postgrads) / total population
  3. Percentage of SHGs = total number of SHGs / total population
  4. Percentage of lactating mother that receive ICDS = Number of lactating mothers receiving services under ICDS / Number of lactating mothers
In [10]:
# No 01
subset_df['percent_immunized_child'] = (subset_df['total_childs_aged_0_to_3_years_i'] / subset_df['total_childs_aged_0_to_3_years']).round(3) * 100

# No 02
subset_df['percent_grd'] = (subset_df['total_grd_and_pg_in_village'] / subset_df['total_population']).round(3) * 100

# No 03
subset_df['percent_shg'] = (subset_df['total_shg'] / subset_df['total_population']).round(5)

# No 04
subset_df['percent_icds_women'] = (subset_df['total_no_of_lactating_mothers_re'] / subset_df['total_no_of_lactating_mothers']).round(3) * 100

Finally, check if there is more NAN value

In [11]:
# Recheck if there is NA value again
column_where_na = []
row_where_na = []

# Select all columns inside the subset_df DataFrame
all_columns = subset_df.columns

# Line below is used by me to check in each of column, if there is NA. then if there is NA, put it in the new empty array above.
for col in all_columns:
    if subset_df[col].isna().any():
        na_loc = subset_df.index[subset_df[col].isna()].tolist()
        print(f"{col}: NaN at index {na_loc}")
        column_where_na.append(col)
        row_where_na.extend(na_loc)

if not column_where_na:
    print("No NaN values found. Great!")
else:
    print("There are NaN values. Please check.")
    print("Columns with NaN values:", column_where_na)
    print("Rows with NaN values:", row_where_na)
No NaN values found. Great!

3. Initial Visualization

AT FIRST we will try to focus on the absolute number (not per capita number) of variables to see the global overview.

Histogram will be used to see the profile of each of the variables. It will be mainly to see how centralized the data will be. Also, it is beneficial in the stage of "Quantiles".

In [12]:
# I defined which column that I will plot in histogram
columns_hist = [
    'total_grd_and_pg_in_village',
    'total_shg',
    'total_no_of_lactating_mothers_re',
    'total_childs_aged_0_to_3_years_i'
]

# Then, I set the number of columns and rows for the layout
num_cols = 2
num_rows = 2

# Set up subplots for each column
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 6))
axes = axes.flatten()

# Create histograms in each of the columns
for i, column in enumerate(columns_hist):
    axes[i].hist(subset_df[column].dropna(), bins=20, color='blue', alpha=0.7)
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Frequency')

    if column == 'total_grd_and_pg_in_village':
        axes[i].set_title('Number of Graduates and Postgrads')
        axes[i].set_xlim(0, 50000)  # Set x-limit from 0 to 10000

    if column == 'total_shg':
        axes[i].set_title('Number of SHGs')
        axes[i].set_xlim(0, 8000)  # Set x-limit from 0 to 10000

    if column == 'total_no_of_lactating_mothers_re':
        axes[i].set_title('Number of Lactating Mothers Receiving ICDS')
        axes[i].set_xlim(0, 12000)  # Set x-limit from 0 to 10000

    if column == 'total_childs_aged_0_to_3_years_i':
        axes[i].set_title('Number of Children Age 0-3 Years with Immunization')
        axes[i].set_xlim(0, 40000)  # Set x-limit from 0 to 10000    

# Add a single title for all subplots
fig.suptitle('Histogram of Socioeconomic Key Variables in India', fontsize=16)
plt.tight_layout()
plt.show()

Graphical Excellence

  1. Wise use of space: previously, the value axis spans very long (because of the outliers) and it makes the form of the distribution not well, hence I use xlim to focus on important data.
  2. Clear purpose and clear label on axis and title (I do not use the variable name in the dataframe, but use true name of the variable). Also put the name location to prevent bias of "where is this place"?
  3. Not distorting the data: modification is less to show the bar naturally.

Next, we want to test hypothesis H1, H2, and H3 To test the hypothesis, the pearson correlation and heat map is used. Also, this is to make sure that there exist correlation of independent and dependent variable of the Hypothesis.

The plot will be completed with linear regression line and confidence interval to show the confident-spread of the data.

In [13]:
# Declareing which columns that we want to see the correlation
columns_corr = [
    'total_childs_aged_0_to_3_years_i',
    'total_grd_and_pg_in_village',
    'total_shg',
    'total_no_of_lactating_mothers_re'
]

# Create a subset dataframe with only the specified columns as specified above
subset_corr_df = subset_df[columns_corr]

# Change the name of the variable so that it is understandable by HUMAN!
subset_corr_df.columns = ['No. of Childs (0-3) Immunized','Education Level', 'No. of SHGs', 'No. of Lact. Mothers with ICDS']

# Create a pair plot
pair_plot = sns.pairplot(subset_corr_df, kind='scatter', diag_kind='hist')

# Change the colour to black (scatter plot)
for i, j in zip(*plt.np.triu_indices_from(pair_plot.axes, 1)):
    sns.scatterplot(x=subset_corr_df.columns[j], y=subset_corr_df.columns[i], data=subset_corr_df, color='black', ax=pair_plot.axes[i, j])

# Add orange linear regression lines with confidence intervals
for i, j in zip(*plt.np.triu_indices_from(pair_plot.axes, 1)):
    sns.regplot(x=subset_corr_df.columns[j], y=subset_corr_df.columns[i], data=subset_corr_df, scatter=False, color='orange', ax=pair_plot.axes[i, j])

# Add title
pair_plot.fig.suptitle('Pair Plot of Key Variables regarding Immunization Level of Childs (0-3) in India', y=1.02, fontsize=16)

# Show the plot
plt.show()

Graphical Excellence

  1. Wise use of space: the graph is efficient in looks because it uses pairplot
  2. Clear purpose and clear label on axis and title (I do not use the variable name in the dataframe, but use true name of the variable). Also put the name location to prevent bias of "where is this place"?
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Colour is selected so that it is contrast and colour blind friendly: black and orange is quite contrast (blue and red is too polar different!)

From pairplot above, we can observe that the basic variable (not percapita) of Number of Children (0-3 years) immunized is correlated to the variables: Education Level, No of SHGs, and No of Lactating Mothers Receiving ICDS. To make it more clear, we can also see the heatmap.

In [14]:
# Create a heatmap of the correlation matrix
correlation_matrix = subset_corr_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)

# Show the plot
plt.title('Heatmap of Socio-economic Variables related to Immunization of Childs (0-3) in India', fontsize=12)
plt.show()

Graph Excellence

  1. Wise use of space
  2. Clear purpose and clear label on axis and title (I do not use the variable name in the dataframe, but use true name of the variable). Also put the name location to prevent bias of "where is this place"?
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Encourage the eye, using colour, to understand the data (main purpose of heatmap)

Some important that we get from the heatmap above:

  1. That No of SHGs has low correlation (compared to other variables) making it good candidate of independent variable.
  2. No of Child (0-3) with Immunization has strong correlation with No of Lactating Mothers receive ICDS subsidy.
  3. Other variables also has relatively hi correlation (education level vs No of Lactating Mothers receive ICDS subsidy), which is quite strange. Further evaluation is needed!

Well we found that some independent variables are having correlation with other independent variables (not excellent candidate for independent variables, then). Next, we try to be more direct to the hypothesis.

Let us check all variables on H1, H2, H3, which are more per capita variables (more inclusive and just).

In [15]:
# Define which variables we want to observe
columns_name = [
    'percent_immunized_child',
    'percent_grd',
    'percent_shg',
    'percent_icds_women',
]
In [16]:
# Focus on dependent vs all independent variables on H1-H3
subset_corr_df = subset_df[columns_name]
subset_corr_df.columns = ['No. of Childs (0-3) Immunized','Education Level', 'No. of SHGs', 'No. of Lact. Mothers with ICDS']

# Create a pair plot with 'percent_immunized_child' on the y-axis and others on the x-axis
pair_plot = sns.pairplot(subset_corr_df, y_vars=['No. of Childs (0-3) Immunized'], x_vars=subset_corr_df.columns[1:], kind='scatter', diag_kind='hist')

# Show the plot
pair_plot.fig.suptitle('Pair Plot of Key Variables regarding Immunization Level of Childs (0-3) in India', y=1.12, fontsize=12)
plt.show()

Principle of excellence:

  1. Wise use of space: the graph is efficient in looks because it uses pairplot
  2. Clear purpose and clear label on axis and title (I do not use the variable name in the dataframe, but use true name of the variable). Also put the name location to prevent bias of "where is this place"?
  3. Select the best graph type (use scatter plot to see the scatterness of each dot of the data)
In [17]:
# Create a heatmap of the correlation matrix
correlation_matrix = subset_corr_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)

# Show the plot
plt.title('Heatmap of Socio-economic Variables (per capita) related to Immunization of Childs (0-3) in India', fontsize=12)
plt.show()

Graph Excellence

  1. Wise use of space
  2. Clear purpose and clear label on axis and title (I do not use the variable name in the dataframe, but use true name of the variable). Also put the name location to prevent bias of "where is this place"?
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Encourage the eye, using colour, to understand the data (main purpose of heatmap)

To make sure the significance, we will also calculate p value. Low p-value (typically under 5%) indicating a significant correlation between the variables. The null hypothesis for a correlation test typically assumes that there is no correlation between the variables.

In [18]:
# Calculate correlation matrix with p-values
corr_matrix = subset_corr_df.corr()
p_values = np.zeros_like(corr_matrix)

for i in range(corr_matrix.shape[0]):
    for j in range(corr_matrix.shape[1]):
        _, p_value = pearsonr(subset_corr_df.iloc[:, i], subset_corr_df.iloc[:, j])
        p_values[i, j] = p_value

# Create a list of p-values
p_values_list = []

for i in range(corr_matrix.shape[0]):
    for j in range(corr_matrix.shape[1]):
        p_values_list.append((subset_corr_df.columns[i], subset_corr_df.columns[j], p_values[i, j]))

# Print the list of p-values
for p_value_info in p_values_list:
    print(f"P-value between {p_value_info[0]} and {p_value_info[1]}: {p_value_info[2]:.3f}")
P-value between No. of Childs (0-3) Immunized and No. of Childs (0-3) Immunized: 0.000
P-value between No. of Childs (0-3) Immunized and Education Level: 0.000
P-value between No. of Childs (0-3) Immunized and No. of SHGs: 0.000
P-value between No. of Childs (0-3) Immunized and No. of Lact. Mothers with ICDS: 0.000
P-value between Education Level and No. of Childs (0-3) Immunized: 0.000
P-value between Education Level and Education Level: 0.000
P-value between Education Level and No. of SHGs: 0.000
P-value between Education Level and No. of Lact. Mothers with ICDS: 0.000
P-value between No. of SHGs and No. of Childs (0-3) Immunized: 0.000
P-value between No. of SHGs and Education Level: 0.000
P-value between No. of SHGs and No. of SHGs: 0.000
P-value between No. of SHGs and No. of Lact. Mothers with ICDS: 0.000
P-value between No. of Lact. Mothers with ICDS and No. of Childs (0-3) Immunized: 0.000
P-value between No. of Lact. Mothers with ICDS and Education Level: 0.000
P-value between No. of Lact. Mothers with ICDS and No. of SHGs: 0.000
P-value between No. of Lact. Mothers with ICDS and No. of Lact. Mothers with ICDS: 0.000

Reflection

Based on the provided p-values:

H1: Between "No. of Childs (0-3) Immunized" and "Education Level": p-value = 0.000 (significant) H2: Between "No. of Childs (0-3) Immunized" and "No. of SHGs": p-value = 0.000 (significant) H3: Between "No. of Childs (0-3) Immunized" and "No. of Lact. Mothers with ICDS": p-value = 0.000 (significant)

These small p-values support the idea that there is a statistically significant positive correlation between the number of immunized children (age 0-3 years) and each of the variables: "Education Level," "No. of SHGs," and "No. of Lact. Mothers with ICDS." This aligns with the positive impacts suggested by hypotheses H1, H2, and H3.

However, some independent variables are having correlation to other independent variables! We will need to evaluate it on next research because there is possibility of bias (foe example, instead of independent variables, there might be dependent variables that will be unveils). For temporary time, we will assume the H1-H3 using the significance above.


Critical DS Framework Reflection

  1. Inclusivity: inclusivity has already taken into account in this analysis by using per capita and all-wide India data.
  2. Bias data: there is significancee on H1-H3, however heatmap showed that some independent variables are not purely independent!

Let us discuss further on next preliminary research, because we need more expert on discussing those cases (my current knowledge is lacking, I need Trivik :) )

Next, we will try to test H4 and H5.

For H4, we need OSMNx AMENITIES to get new variable "Number of Health Facilities in an Area"!


Part 04: Spatial Visualization ¶

In this part, main activities are:

  1. Merging shp and csv files
  2. Visualize the map, highlight each of the important variables on the map
  3. Create Cloropleth

1. Merging shp and csv files

Here, we will merge the shp and csv file. The key index that combines them are "pc11_subdistrict_id" and "pc11_district_id". Since our analysis will be focused on subdistrict level, the file below will be combined:

  • Subset data from step 01 and 02 above, in subdistrict level (already explained above why I choose subdistrict level)
  • SHP file on subdistrict, from the SHRUG website.
In [19]:
map_path = './data/shrug-pc11subdist-poly-gpkg/subdistrict.gpkg'
map = gpd.read_file(map_path)
gdf = map
In [20]:
# Merge Data of subset_dataframe (subset_df) with the geo data frame (gdf)
# But first, let's examine the type of the column
gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 5969 entries, 0 to 5968
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   pc11_state_id        5969 non-null   object  
 1   pc11_district_id     5969 non-null   object  
 2   pc11_subdistrict_id  5969 non-null   object  
 3   subdistrict_name     5966 non-null   object  
 4   geometry             5969 non-null   geometry
dtypes: geometry(1), object(4)
memory usage: 233.3+ KB
In [21]:
# and the subset_df
subset_df.info()
subset_df.columns
<class 'pandas.core.frame.DataFrame'>
Index: 5420 entries, 1 to 5479
Data columns (total 17 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   pc11_state_id                     5420 non-null   int64  
 1   pc11_district_id                  5420 non-null   int64  
 2   pc11_subdistrict_id               5420 non-null   int64  
 3   total_population                  5420 non-null   float64
 4   male_population                   5420 non-null   float64
 5   female_population                 5420 non-null   float64
 6   total_hhd                         5420 non-null   float64
 7   total_grd_and_pg_in_village       5420 non-null   float64
 8   total_shg                         5420 non-null   float64
 9   total_no_of_lactating_mothers     5420 non-null   float64
 10  total_no_of_lactating_mothers_re  5420 non-null   float64
 11  total_childs_aged_0_to_3_years    5420 non-null   float64
 12  total_childs_aged_0_to_3_years_i  5420 non-null   float64
 13  percent_immunized_child           5420 non-null   float64
 14  percent_grd                       5420 non-null   float64
 15  percent_shg                       5420 non-null   float64
 16  percent_icds_women                5420 non-null   float64
dtypes: float64(14), int64(3)
memory usage: 762.2 KB
Out[21]:
Index(['pc11_state_id', 'pc11_district_id', 'pc11_subdistrict_id',
       'total_population', 'male_population', 'female_population', 'total_hhd',
       'total_grd_and_pg_in_village', 'total_shg',
       'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
       'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i',
       'percent_immunized_child', 'percent_grd', 'percent_shg',
       'percent_icds_women'],
      dtype='object')

As you can see the column 'pc11_subdistrict_id' of GDF is stored as object and the other one (the subset_df, from Part 02) is stored as int. We will change both the object and int datatype to float, to prevent if in the future we will need to merge some other data. Float has more 'flexible' value than 'integer' (for example, float can have decimal).

In [22]:
gdf.pc11_subdistrict_id = gdf.pc11_subdistrict_id.astype(float)
subset_df.pc11_subdistrict_id = subset_df.pc11_subdistrict_id.astype(float)
In [23]:
# Name the merged variable as geo_pop
geo_pop = gdf.merge(subset_df, on='pc11_subdistrict_id')
geo_pop.head()
Out[23]:
pc11_state_id_x pc11_district_id_x pc11_subdistrict_id subdistrict_name geometry pc11_state_id_y pc11_district_id_y total_population male_population female_population ... total_grd_and_pg_in_village total_shg total_no_of_lactating_mothers total_no_of_lactating_mothers_re total_childs_aged_0_to_3_years total_childs_aged_0_to_3_years_i percent_immunized_child percent_grd percent_shg percent_icds_women
0 24 468 3722.0 Lakhpat MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... 24 468 62117.708950 32088.750366 30025.930963 ... 394.599993 566.165208 227.071608 204.869050 3082.118619 2619.901747 85.0 0.6 0.00911 90.2
1 24 468 3723.0 Rapar MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... 24 468 201902.906291 105105.320376 96796.567185 ... 2531.541403 339.236735 930.099517 861.844679 6842.802254 4941.854062 72.2 1.3 0.00168 92.7
2 24 468 3724.0 Bhachau MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... 24 468 176504.000000 92678.000000 83460.000000 ... 2577.000000 234.000000 952.000000 790.000000 6897.000000 4373.000000 63.4 1.5 0.00133 83.0
3 24 468 3725.0 Anjar POLYGON ((70.23631 23.40849, 70.23631 23.40545... 24 468 150105.112093 79896.579777 70092.396031 ... 3643.775958 491.505351 1147.882751 1027.598741 5673.050161 4004.627988 70.6 2.4 0.00327 89.5
4 24 468 3726.0 Bhuj POLYGON ((69.78433 23.99110, 69.79143 23.98750... 24 468 238834.277866 119211.778880 113669.738389 ... 2819.277177 230.560895 629.484862 570.504168 7436.929336 5231.051377 70.3 1.2 0.00097 90.6

5 rows × 21 columns

In [24]:
# Because pc11_state_id_x = pc11_state_id_y and pc11_state_id_x = pc11_state_id_y
# We will delete column of y
# and rename the column pc11_state_id_x = pc11_state_id and pc11_state_id_x = pc11_state_id

geo_pop = geo_pop.drop(['pc11_state_id_y', 'pc11_district_id_y'], axis=1)

# Rename columns
geo_pop = geo_pop.rename(columns={'pc11_state_id_x': 'pc11_state_id', 'pc11_district_id_x': 'pc11_district_id'})
In [25]:
# Check if the crs is lose. Turns out that the crs is not lose, it still uses EPSG 4326. Seems fine, right?? :D
geo_pop.crs
Out[25]:
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Below, we will also try to merge between "district.gpkg" and "subset_df". This is one part of "non linear" process, because in some lines code below, at determining "spatial weight", the island was somehow too much (using the geo_pop above)! I think this is because of using "subdistrict.gpkg" data. So, for convenience and better data quality, I will also merge "district.gpkg" and "subset_df".

When islands are not surrounded by similar values, they can disrupt the spatial autocorrelation patterns that are essential for many spatial analysis techniques.

In [26]:
mapdist_path = './data/shrug-pc11dist-poly-gpkg/district.gpkg'
map_dist = gpd.read_file(mapdist_path)

Since the GPKG data above is in district, while "subset_df" is in subdistrict, we need to create pivot containing sum of all the columns (total population, total childs, etc)

In [27]:
columns_to_sum = ['total_population', 'male_population', 'female_population', 'total_hhd',
                  'total_grd_and_pg_in_village', 'total_shg',
                  'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
                  'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i']

# Create a pivot table
pivot_table_df = subset_df.pivot_table(values=columns_to_sum, index='pc11_district_id', aggfunc='sum')

pivot_table_df
Out[27]:
female_population male_population total_childs_aged_0_to_3_years total_childs_aged_0_to_3_years_i total_grd_and_pg_in_village total_hhd total_no_of_lactating_mothers total_no_of_lactating_mothers_re total_population total_shg
pc11_district_id
1 23523.138329 26510.879626 2761.074569 2590.694210 704.340235 8981.478930 505.056064 479.194760 5.004162e+04 1.521253
2 360076.838747 392652.138478 29465.549515 19313.974815 33116.408459 124419.391360 4322.707964 3710.070887 7.551146e+05 1291.782192
3 45598.474752 47615.448440 2837.299473 2422.499434 3636.436542 16684.780804 791.205308 756.308394 9.321936e+04 350.046346
4 78150.007679 78037.294461 7344.751470 5050.990314 4134.530494 23630.458275 1106.910964 961.424367 1.760563e+05 653.763285
5 230263.244779 240528.135141 18368.934333 11508.827636 11365.975014 92762.747369 5829.314660 4037.745077 4.717603e+05 1120.292371
... ... ... ... ... ... ... ... ... ... ...
632 450007.906654 447742.145346 18915.735027 15588.340247 40968.391785 289061.923978 4448.443895 4225.381399 9.028997e+05 4561.162494
633 596281.419949 595956.683533 23808.052542 19714.351280 67835.559509 391963.629490 4917.232443 4590.071292 1.195993e+06 4454.343346
638 3154.866472 3307.155473 257.055194 213.852640 70.204149 1883.631334 22.681341 22.681341 6.462022e+03 35.642107
639 54314.627363 57265.113737 3669.293400 3063.362231 5526.490359 29831.195841 662.187411 636.354232 1.116362e+05 743.045473
640 53502.795148 58770.591779 3692.091661 3419.911097 4963.049693 26783.402756 739.984115 678.350943 1.123320e+05 725.685726

613 rows × 10 columns

Do the merge of the two dataset (GPKG and pivot_table) by the same step as subdistrict above.

In [28]:
# Convert to astype
map_dist.pc11_district_id = map_dist.pc11_district_id.astype(float)
geo_pop.pc11_district_id = geo_pop.pc11_district_id.astype(float)
pivot_table_df.index = pivot_table_df.index.astype(float)

# Next, merge using the indices
geo_pop_district = map_dist.merge(pivot_table_df, left_on='pc11_district_id', right_index=True)
geo_pop_district.head()
Out[28]:
pc11_state_id pc11_district_id district_name geometry female_population male_population total_childs_aged_0_to_3_years total_childs_aged_0_to_3_years_i total_grd_and_pg_in_village total_hhd total_no_of_lactating_mothers total_no_of_lactating_mothers_re total_population total_shg
0 24 468.0 Kachchh MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... 6.918732e+05 7.659721e+05 52789.062307 36124.140136 29509.874064 332446.747616 7194.496554 6452.874545 1.465846e+06 5262.200653
1 24 469.0 Banas Kantha MULTIPOLYGON (((71.24964 24.20926, 71.24207 24... 1.490841e+06 1.603543e+06 137018.185689 93865.516320 103408.588403 607873.050328 19340.773617 16878.197941 3.273774e+06 13546.601715
2 24 470.0 Patan MULTIPOLYGON (((71.42507 23.96967, 71.42497 23... 5.204153e+05 5.651527e+05 49457.172519 30594.736482 64222.702638 258319.762048 9340.013570 7797.987739 1.089937e+06 14351.448633
3 24 471.0 Mahesana POLYGON ((72.79975 24.07615, 72.80022 24.07529... 7.639744e+05 8.327545e+05 59425.341078 49149.072387 123971.352675 380288.478687 8296.862992 6994.736144 1.617866e+06 15483.493324
4 24 472.0 Sabar Kantha POLYGON ((73.14784 24.47759, 73.14773 24.47410... 1.096367e+06 1.183334e+06 87560.815945 66651.606962 152636.838216 582154.513255 16705.055796 14374.734389 2.320364e+06 20658.175179

Also, do not forget to calculate the needed variables in Hypothesis!

In [29]:
# No 01
geo_pop_district['percent_immunized_child'] = (geo_pop_district['total_childs_aged_0_to_3_years_i'] / geo_pop_district['total_childs_aged_0_to_3_years']).round(3) * 100

# No 02
geo_pop_district['percent_grd'] = (geo_pop_district['total_grd_and_pg_in_village'] / geo_pop_district['total_population']).round(3) * 100

# No 03
geo_pop_district['percent_shg'] = (geo_pop_district['total_shg'] / geo_pop_district['total_population']).round(5)

# No 04
geo_pop_district['percent_icds_women'] = (geo_pop_district['total_no_of_lactating_mothers_re'] / geo_pop_district['total_no_of_lactating_mothers']).round(3) * 100

Finally, I think this is also good idea to put "district name" into the "geo_pop" data because it contains only subdistrict name. Then, we can use vlookup from the "geo_pop_district".

In [30]:
geo_pop_district.pc11_district_id = map_dist.pc11_district_id.astype(float)
pivot_table_df.index = pivot_table_df.index.astype(float)

# Merge based on pc11_district_id
geo_pop = pd.merge(geo_pop, geo_pop_district[['pc11_district_id', 'district_name']], on='pc11_district_id', how='left')

2. Spatial Visualization

Let us see a glimpse of what India looks like!

In [31]:
# Setup figure and axis
f, ax = plt.subplots(1)
# Add layer of polygons on the axis
gdf.plot(ax=ax)
# Add figure title
f.suptitle('India Map')
# Display
plt.axis('equal')
# Add labels to axes
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.show()

Graphical Excellence

  1. Wise use of space, because it contains and clearly depicts all whole India.
  2. Clear purpose and clear label on axis and title (latitude, longitude, and title)
  3. Not distorting the data: plt.axis('equal')

We can also add the map from CTX! To make it more beautiful and you can easily imagine it above the atlas map

In [32]:
# We can also add the map from CTX! To make it more beautiful and you can easily imagine it above the atlas map
# Load local Authority geometries using pc11_subdistrict_id code as index
map = gpd.read_file(map_path).set_index('pc11_subdistrict_id', drop=False)
ax = map.plot(figsize=(6, 6), alpha=0.3, color='red');
# Add background map, expressing target CRS so the basemap can be
# reprojected (warped)
ctx.add_basemap(ax, crs=map.crs)
ax.set_title('Map of India')
Out[32]:
Text(0.5, 1.0, 'Map of India')

Graphical Excellence

  1. Wise use of space, because it contains and clearly depicts all whole India.
  2. Clear purpose and clear label on title
  3. Explore several ways to display the data: give maps from OSM so audience can clearly imagine India vs other countries.
  4. Reveal the data in broad view.

Let us see if, geographically (by map here), the top 1000 subdistricts (represent 20%) on each of our variables, are close to each others Will be investigated: Immunization of child vs total lactating mothers that are under ICDS (total_no_of_lactating_mothers_re)

The purpose of this 1000 is to give a glimpse (like head and tail) of the dataset

In [33]:
number_cut = 1000
highest_imun = geo_pop.sort_values('percent_immunized_child',ascending=False).head(number_cut)
highest_shgs = geo_pop.sort_values('percent_shg',ascending=False).head(number_cut)
highest_grd = geo_pop.sort_values('percent_grd',ascending=False).head(number_cut)
highest_ICDS = geo_pop.sort_values('percent_icds_women',ascending=False).head(number_cut)
In [73]:
f, ax = plt.subplots(1, figsize=(6, 6))

background = geo_pop.plot(facecolor='blue', linewidth=0.025, ax=ax, alpha=0.2)
highest_ICDS.plot(alpha=0.9, facecolor='green', linewidth=0, ax=ax)
highest_imun.plot(alpha=0.7, facecolor='red', linewidth=0, ax=ax)

ax.set_axis_off()
f.suptitle('1000 Subdistricts with Highest Number of Lactating Mother (ICDS) and Children with Immunization in India')

# Create custom legend entries
legend_elements = [
    Line2D([0], [0], color='blue', alpha=0.2, lw=2, label='All Areas'),
    Line2D([0], [0], marker='o', color='green', label='Highest % Lact. Mother w ICDS'),
    Line2D([0], [0], marker='o', color='red', label='Highest % Child w Immunization'),
]
ax.legend(handles=legend_elements)
plt.axis('equal')
plt.show()

Graphical Excellence

  1. Wise use of space, because it contains and clearly depicts all whole India.
  2. Clear purpose and clear legend
  3. Explore several ways to display the data: give maps from OSM so audience can clearly imagine India vs other countries.
  4. Reveal the data in broad view + more granular data

From picture above we can see that, the strongest correlation variables (explained in previous heatmap) also depicted in the map. Subdistrict with high number of lactating mother receiving ICDS is also relatively around high level of children's immunization.

We will see clearer using Choroplets.

3. Create Choropleths

Chorophlets that I will make is based on the option "Quantiles". I chose this because, as you can see on Part 02, most of the data are skewed to the left (positive skew), more or less resembles Poisson Distribution with low lambda. This quantiles is suitable for skewed distributions and helps in handling outliers. It provides good representation by placing an approximately equal number of observations in each interval.

In [35]:
num_division = 11 # Divide into how many section
classi = mapclassify.Quantiles(geo_pop['percent_immunized_child'], k=num_division)
classi
Out[35]:
Quantiles

    Interval       Count
------------------------
[  0.00,  47.30] |   497
( 47.30,  57.12] |   494
( 57.12,  63.40] |   490
( 63.40,  68.20] |   511
( 68.20,  72.50] |   490
( 72.50,  76.20] |   498
( 76.20,  80.10] |   492
( 80.10,  83.70] |   494
( 83.70,  87.60] |   498
( 87.60,  92.50] |   491
( 92.50, 100.00] |   492
In [36]:
# Set up the figure to explain the Quantiles that I made
f, ax = plt.subplots(1, figsize=(8, 5))
# Plot the kernel density estimation (KDE)
sns.kdeplot(geo_pop['percent_immunized_child'], shade=True)
# Add a blue tick for every value at the bottom of the plot (rugs)
sns.rugplot(geo_pop['percent_immunized_child'], alpha=0.5)
# Loop over each break point and plot a vertical red line
for cut in classi.bins:
    plt.axvline(cut, color='red', linewidth=0.75)
# Display image
ax.set_title('KDE for Immunized Children (0-3) Variable')    
plt.show()
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/960233916.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(geo_pop['percent_immunized_child'], shade=True)

Graph Excellence

  1. Wise use of space
  2. Clear purpose and clear label on axis and title
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Encourage the eye using KDE
In [37]:
# Create Choropleths using the 'cividis' colormap
geo_pop.plot(column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
         cmap='cividis',  # Use cividis colormap
         edgecolor='w', linewidth=0.1, figsize=(8, 8))
         
# Add title with proper centering
plt.title('Choropleths of Immunized Children (0-3) in India, per Subdistricts', pad=20, loc='center')

# Add colorbar with proper positioning
cax = plt.axes([0.95, 0.1, 0.03, 0.8])  # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='cividis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = []  # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)

# Set colorbar label
cbar.set_label('Percentage of children aged 0-3 years immunized')

# Show the plot
plt.show()

Graph Excellence

  1. Use decent colour and blind friendly: I use one kind of Protanope colour (the name is Cividis)
  2. Clear purpose and clear label on axis and title
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Wise use of space

We can see that there are some subdistrict (and district) with high number of immunization and lower number of immunization. We will do case study on three district below and its neighbours (retrieved via mapes https://www.mapsofindia.com/districts-india/ to see where is the neighbour, which also checked using Queen Neighbour during Spatial Weight Calculation). I am also considering inclusivity

  1. Pathanamthitta district, represent highest (right tail of the distribution graph) percentage of immunization level (100%). Its neighbour: 'Alappuzha', 'Kottayam', 'Kollam', 'Idukki', 'Ernakulam'.
  2. Bhandara district,represent around average percentage of immunization level (75%). Its neighbour:'Nagpur','Gondiya','Seoni','Balaghat','Chhindwara','Garhchirolo','Chandrapur','Wardha'
  3. Anjaw district, represent lower (left tail of the distribution graph) percentage of immunization level (30%). Neighbour: 'Lohit','Changlang','Namsai'

Where does those 3 subdistricts come from? I analyzed using and zoom into Top and Bottom district of % of immunization of children.

In [38]:
# Here, I will show you Top 10
n_top_districts = 10

# Set the seaborn color palette to a colorblind-friendly palette
sns.set_palette("colorblind")

# Sort the DataFrame by 'percent_immunized_child' in descending order
sorted_df = geo_pop_district.sort_values(by='percent_immunized_child', ascending=False)

# Select the top districts based on the variable n_top_districts
top_districts = sorted_df.head(n_top_districts)

# Create the bar chart for 'percent_immunized_child'
fig, ax1 = plt.subplots(figsize=(12, 6))

color = sns.color_palette()[0]  # Use the first color from the palette
ax1.set_xlabel('District')
ax1.set_ylabel('Percent Immunized Child', color=color)
ax1.bar(top_districts['district_name'], top_districts['percent_immunized_child'], color=color, label='Percent Immunized Child')
ax1.tick_params(axis='y', labelcolor=color)
ax1.set_ylim(0, 100)  # Set the y-axis limit for 'Percent Immunized Child'

# Create the line chart for 'percent_icds_women'
ax2 = ax1.twinx()
color = sns.color_palette()[1]  # Use the second color from the palette
ax2.set_ylabel('Percent ICDS Women', color=color)
ax2.plot(top_districts['district_name'], top_districts['percent_icds_women'], color=color, marker='o', linestyle='--', label='Percent ICDS Women')
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim(0, 100)  # Set the y-axis limit for 'Percent ICDS Women'

plt.title(f'{n_top_districts} Districts with Highest Percentage of Immunized Child and ICDS Women')
fig.tight_layout()
fig.legend(loc='upper left', bbox_to_anchor=(0.8, 0.2))  # Set the legend to a custom position
plt.show()

Graph Excellence

  1. Clear purpose and clear label on axis and title
  2. Not distorting the data: modification is less to show the bar naturally of the plot.
  3. Wise use of space (maximized)
  4. Clear legends
  5. Coloud blind friendly (contrast but not that polar opposite)
  6. Fair use scale of both y axes (less bias in one look)

As we can see above, there is some outlier case (in Janjgir, the percent of immunized children is high but the number of woman receiving ICDS is very low!) After some check in detail, turns out that this is outlier because the population of Janjgir is very low. Thus, we will subset more on population of district under 10.000

In [39]:
# Sort the DataFrame by 'total_population' in ascending order
geo_pop_district.sort_values(by='total_population').head(5)

# To ma
desired_column_order = ['pc11_state_id', 'pc11_district_id', 'district_name', 'geometry',
                         'female_population', 'male_population',
                         'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i',
                         'total_grd_and_pg_in_village', 'total_hhd',
                         'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
                         'total_shg', 'percent_immunized_child',
                         'percent_grd', 'percent_shg', 'percent_icds_women', 'total_population']

# Reorder the columns
geo_pop_district = geo_pop_district[desired_column_order]

We can also see here that some districts indeed contain very small total population

In [40]:
selected_columns = ['pc11_state_id', 'pc11_district_id', 'district_name', 'geometry',
                     'total_population', 'percent_immunized_child']
geo_pop_district[selected_columns].sort_values(by='total_population').head()
Out[40]:
pc11_state_id pc11_district_id district_name geometry total_population percent_immunized_child
423 12 257.0 Dibang Valley POLYGON ((96.11495 29.39999, 96.14150 29.36881... 292.000000 62.5
61 22 405.0 Janjgir - Champa POLYGON ((82.64385 22.21103, 82.64583 22.20764... 2038.000000 100.0
62 22 406.0 Bilaspur POLYGON ((82.08579 23.11360, 82.08697 23.11342... 2363.000000 10.3
637 35 638.0 Nicobars MULTIPOLYGON (((93.71364 7.21016, 93.71297 7.2... 6462.021944 83.2
451 15 285.0 Serchhip POLYGON ((92.89217 23.56077, 92.89312 23.55949... 15921.602747 96.7
In [41]:
# Filter rows where 'total_population' is greater than or equal to 10000
geo_pop_district = geo_pop_district[geo_pop_district['total_population'] >= 10000]

To make sure that the deleted data is insignificant, we can see how spread the data is using box plot. We can see below that the data is already covered by the the span, hence below 10.000 can be deleted.

In [42]:
# Set up the figure and axes
plt.figure(figsize=(8, 4))

# Create a box plot
sns.boxplot(x='total_population', data=geo_pop_district, color='skyblue')

# Add title and labels
plt.title('Total Population by District')
plt.xlabel('Total Population')

# Show the plot
plt.show()

Next, check again what districts with highest and lowest immunization percentage.

In [43]:
n_top_districts = 10

# Set the seaborn color palette to a colorblind-friendly palette
sns.set_palette("colorblind")

# Sort the DataFrame by 'percent_immunized_child' in descending order
sorted_df = geo_pop_district.sort_values(by='percent_immunized_child', ascending=False)

# Select the top districts based on the variable n_top_districts
top_districts = sorted_df.head(n_top_districts)

# Create the bar chart for 'percent_immunized_child'
fig, ax1 = plt.subplots(figsize=(12, 6))

color = sns.color_palette()[0]  # Use the first color from the palette
ax1.set_xlabel('District')
ax1.set_ylabel('Percent Immunized Child', color=color)
ax1.bar(top_districts['district_name'], top_districts['percent_immunized_child'], color=color, label='Percent Immunized Child')
ax1.tick_params(axis='y', labelcolor=color)
ax1.set_ylim(50, 100)  # Set the y-axis limit for 'Percent Immunized Child'

# Create the line chart for 'percent_icds_women'
ax2 = ax1.twinx()
color = sns.color_palette()[1]  # Use the second color from the palette
ax2.set_ylabel('Percent ICDS Women', color=color)
ax2.plot(top_districts['district_name'], top_districts['percent_icds_women'], color=color, marker='o', linestyle='--', label='Percent ICDS Women')
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim(50, 100)  # Set the y-axis limit for 'Percent ICDS Women'

plt.title(f'{n_top_districts} Districts with Highest Percentage of Immunized Child and ICDS Women')
fig.tight_layout()
fig.legend(loc='upper left', bbox_to_anchor=(0.8, 0.2))  # Set the legend to a custom position
plt.show()

Graph Excellence

  1. Clear purpose and clear label on axis and title
  2. Not distorting the data: modification is less to show the bar naturally of the plot.
  3. Wise use of space (maximized)
  4. Clear legends
  5. Coloud blind friendly (contrast but not that polar opposite)
  6. Fair use scale of both y axes (less bias in one look)
  7. To max view of data, the y lim for both axes are both 50-100 % (not only one sided)

Also, do the same for the lowest level

In [44]:
n_bottom_districts = 10

# Set the seaborn color palette to a colorblind-friendly palette
sns.set_palette("colorblind")

# Sort the DataFrame by 'percent_immunized_child' in ascending order to get the bottom districts
sorted_df_bottom = geo_pop_district.sort_values(by='percent_immunized_child', ascending=True)

# Select the bottom districts based on the variable n_bottom_districts
bottom_districts = sorted_df_bottom.head(n_bottom_districts)

# Create the bar chart for 'percent_immunized_child'
fig, ax1 = plt.subplots(figsize=(12, 6))

color = sns.color_palette()[0]  # Use the first color from the palette
ax1.set_xlabel('District')
ax1.set_ylabel('Percent Immunized Child', color=color)
ax1.bar(bottom_districts['district_name'], bottom_districts['percent_immunized_child'], color=color, label='Percent Immunized Child')
ax1.tick_params(axis='y', labelcolor=color)
ax1.set_ylim(0, 100)  # Set the y-axis limit for 'Percent Immunized Child'

# Create the line chart for 'percent_icds_women'
ax2 = ax1.twinx()
color = sns.color_palette()[1]  # Use the second color from the palette
ax2.set_ylabel('Percent ICDS Women', color=color)
ax2.plot(bottom_districts['district_name'], bottom_districts['percent_icds_women'], color=color, marker='o', linestyle='--', label='Percent ICDS Women')
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim(0, 100)  # Set the y-axis limit for 'Percent ICDS Women'

plt.title(f'{n_bottom_districts} Districts with Lowest Percentage of Immunized Child and ICDS Women')
fig.tight_layout()
fig.legend(loc='upper left', bbox_to_anchor=(0.8, 0.2))  # Set the legend to a custom position
plt.show()

Graph Excellence

  1. Clear purpose and clear label on axis and title
  2. Not distorting the data: modification is less to show the bar naturally of the plot.
  3. Wise use of space (maximized)
  4. Clear legends
  5. Coloud blind friendly (contrast but not that polar opposite)
  6. Fair use scale of both y axes (less bias in one look)

After deciding on the three inclusive districts, we will deep dive and back to focus on testing H4 in each of the districts.

First district analysis + AMENITIES related to healthcare.

Ammenities we can get from OSMNX, after reading the documentation "building": "hospital", "amenity": ["hospital", "clinic","doctors","social_facility"

Pathanamthitta District

In [45]:
# Change 'percentage_immunized' to 'percent_immunized_child' in geo_pop dataframe
geo_pop = geo_pop.rename(columns={'percentage_immunized': 'percent_immunized_child'})

# Select a district with high percentage of immunization also with its neighbour
# To know the neighbour, I use maps here https://www.mapsofindia.com/districts-india/

selected_districts = ['Pathanamthitta', 'Alappuzha', 'Kottayam', 'Kollam', 'Idukki', 'Ernakulam', 'Thrissur', 'Theni', 'Tenkasi', 'Tirunelveli']
area1 = selected_districts  # I will put the above districts to different array, so in the end I can combine all selected districts!
filtered_geo_pop = geo_pop[geo_pop['district_name'].isin(selected_districts)]

# Input data from OSMX
# I use bbox that I input manually after knowing the boundary of the districts and neighbours
north, south, east, west = 10.4, 8.8, 77.4, 76.2
health_facil = ox.features_from_bbox(north, south, east, west, tags={"building": "hospital", "amenity": ["hospital", "clinic","doctors","social_facility"]})
#water = ox.features_from_bbox(north, south, east, west, tags={'natural': 'water'})

# Convert the features into GeoDataFrame
health_facil_gdf = gpd.GeoDataFrame.from_features(health_facil)

num_division = 20  # Divide into how many sections
classi = mapclassify.Quantiles(filtered_geo_pop['percent_immunized_child'], k=num_division)

fig, ax = plt.subplots(figsize=(8, 8))

filtered_geo_pop.plot(ax=ax, column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
                      cmap='viridis',  # Use viridis colormap
                      edgecolor='w', linewidth=0.1, figsize=(8, 8))

# Add colorbar
cax = plt.axes([0.95, 0.1, 0.03, 0.8])  # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = []  # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)

# Set colorbar label
cbar.set_label('Percentage of Children (0-3 years) with Immunization')

health_facil_gdf.plot(ax=ax, color='red', markersize=10, label='Hospitals and Health Facility')  # Use dark red color
#water_gdf.plot(ax=ax, color='blue')


# Add legend
ax.legend()
ax.set_ylim(south-0.2, north+0.2)
ax.set_xlim(west-0.2, east+0.2)

ctx.add_basemap(ax, crs=geo_pop.crs)

ax.set_title('Immunization Percentage in Pathanamthitta Districts and surroundings')
# Show the plot
plt.show()
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/mapclassify/classifiers.py:1592: UserWarning: Not enough unique values in array to form 20 classes. Setting k to 17.
  self.bins = quantile(y, k=k)
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/mapclassify/classifiers.py:1592: UserWarning: Not enough unique values in array to form 20 classes. Setting k to 17.
  self.bins = quantile(y, k=k)
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/2884155885.py:43: UserWarning: Legend does not support handles for PatchCollection instances.
See: https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html#implementing-a-custom-legend-handler
  ax.legend()

Graph Excellence

  1. Use decent colour and blind friendly: I use one kind of Protanope colour (the name is Viridis)
  2. Clear purpose and clear label on axis and title
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Wise use of space
  5. Easy to read because we can see clearly the area with high and low number of % of immunization level

Evaluation

We can see on this district that both of high level and low level of % of immunization on children has spread hospital and health facility (not necessarily centered to one area).

In [46]:
selected_districts = ['Anjaw', 'Lohit', 'Changlang', 'Namsai']
area2 = selected_districts  # I will put the above districts into a different array so that I can combine all selected districts!
filtered_geo_pop = geo_pop[geo_pop['district_name'].isin(selected_districts)]

north, south, east, west = 28.4, 27, 97.5, 95.5
health_facil = ox.features_from_bbox(north, south, east, west, tags={"building": "hospital", "amenity": ["hospital", "clinic", "doctors", "social_facility"]})

# Convert the features into GeoDataFrame
health_facil_gdf = gpd.GeoDataFrame.from_features(health_facil)

num_division = 20  # Divide into how many sections
classi = mapclassify.Quantiles(filtered_geo_pop['percent_immunized_child'], k=num_division)

fig, ax = plt.subplots(figsize=(10, 10))

filtered_geo_pop.plot(ax=ax, column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
                      cmap='viridis',  # Use viridis colormap
                      edgecolor='w', linewidth=0.1, figsize=(8, 8))

# Add colorbar
cax = plt.axes([0.95, 0.1, 0.03, 0.8])  # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = []  # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)

# Set colorbar label
cbar.set_label('Percentage of Children (0-3 years) with Immunization')

health_facil_gdf.plot(ax=ax, color='red', markersize=10, label='Hospitals and Health Facility')  # Use dark red color

# Add legend
ax.legend()

ctx.add_basemap(ax, crs=geo_pop.crs)

ax.set_title('Immunization Percentage in Anjaw Districts and surroundings')
# Show the plot
plt.show()
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/mapclassify/classifiers.py:1592: UserWarning: Not enough unique values in array to form 20 classes. Setting k to 17.
  self.bins = quantile(y, k=k)
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/mapclassify/classifiers.py:1592: UserWarning: Not enough unique values in array to form 20 classes. Setting k to 17.
  self.bins = quantile(y, k=k)

Graph Excellence

  1. Use decent colour and blind friendly: I use one kind of Protanope colour (the name is Viridis)
  2. Clear purpose and clear label on axis and title
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Wise use of space
  5. Easy to read because we can see clearly the area with high and low number of % of immunization level

Evaluation

We can see also on this district it has very less of health related facilities. This is because we choose this district as one of the lowest percentage of % immunization on children age 0-3 years. But as you can clearly see that both district with high level and low level of % of immunization on children also has very little of health facilities.

In [47]:
selected_districts = ['Bhandara', 'Nagpur', 'Gondiya', 'Seoni', 'Balaghat', 'Chhindwara', 'Garhchirolo', 'Chandrapur', 'Wardha']
area3 = selected_districts  # I will put the above districts into a different array so that I can combine all selected districts!
filtered_geo_pop = geo_pop[geo_pop['district_name'].isin(selected_districts)]

north, south, east, west = 22.7, 19.5, 81, 78.5
health_facil = ox.features_from_bbox(north, south, east, west, tags={"building": "hospital", "amenity": ["hospital", "clinic", "doctors", "social_facility"]})

# Convert the features into GeoDataFrame
health_facil_gdf = gpd.GeoDataFrame.from_features(health_facil)

num_division = 20  # Divide into how many sections
classi = mapclassify.Quantiles(filtered_geo_pop['percent_immunized_child'], k=num_division)

fig, ax = plt.subplots(figsize=(10, 10))

filtered_geo_pop.plot(ax=ax, column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
                      cmap='viridis',  # Use viridis colormap
                      edgecolor='w', linewidth=0.1, figsize=(8, 8))

# Add colorbar
cax = plt.axes([0.95, 0.1, 0.03, 0.8])  # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = []  # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)

# Set colorbar label
cbar.set_label('Percentage of Children (0-3 years) with Immunization')

health_facil_gdf.plot(ax=ax, color='red', markersize=10, label='Hospitals and Health Facility')  # Use dark red color

# Add legend
ax.legend()
ctx.add_basemap(ax, crs=geo_pop.crs)

# Show the plot
ax.set_title('Immunization Percentage in Bhandara Districts and surroundings')
plt.show()
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/203090821.py:32: UserWarning: Legend does not support handles for PatchCollection instances.
See: https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html#implementing-a-custom-legend-handler
  ax.legend()

Graph Excellence

  1. Use decent colour and blind friendly: I use one kind of Protanope colour (the name is Viridis)
  2. Clear purpose and clear label on axis and title
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Wise use of space
  5. Easy to read because we can see clearly the area with high and low number of % of immunization level

Evaluation

After seeing those three representable districts, can we conclude that the H4 is rejected? Because the location of the health facility is not also concentrated to districts with low % of immunization. But well, let us not try to use our eyes only, let us check the data and use pearson correlation again.

From graphs above, we need to get new column to be added to geo_pop dataframe.

Use steps below:

  1. Slice the Health Facilities to know what district they belong to
  2. Calculate the area of each district
  3. Get more inclusive and representable new variable: number of health related facilities per area of district!
In [48]:
# Join each hospital+health facility with the corresponding subdistrict
# Here, I use "inner" (use intersection of the geometry) and "within" (choose that is inside) 
# The data that will be joined are: health facility gdf and the main dataframe (filtered based on districts)
selected_districts = area1 + area2 + area3 
filtered_geo_pop = geo_pop[geo_pop['district_name'].isin(selected_districts)]

joined_data = gpd.sjoin(health_facil_gdf, filtered_geo_pop, how="inner", op="within")

# Calculate the number of hospitals per area
hospital_counts = joined_data.groupby('subdistrict_name').size().reset_index(name='health_facil_count')

# Calculate the area of each subdistrict
calc_area = filtered_geo_pop[['subdistrict_name','district_name','geometry','percent_immunized_child']].copy()
calc_area = calc_area.to_crs('EPSG:32633')
calc_area['total_area'] = calc_area.geometry.area / 10**6

# Merge hospital_counts and district_areas to create the new array
health_facility_density_df = pd.merge(hospital_counts, calc_area[['district_name','subdistrict_name','percent_immunized_child','total_area','geometry']], on='subdistrict_name', how='outer')
health_facility_density_df = health_facility_density_df.fillna(0)
health_facility_density_df['health_facility/sq.km'] = health_facility_density_df['health_facil_count'] / health_facility_density_df['total_area']

column_order = ['district_name', 'subdistrict_name','health_facil_count', 'total_area','health_facility/sq.km','percent_immunized_child','geometry']
health_facility_density_df = health_facility_density_df[column_order]
health_facility_density_df.head()
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3488: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead.
  if await self.run_code(code, result, async_=asy):
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/4145092681.py:7: UserWarning: CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326

  joined_data = gpd.sjoin(health_facil_gdf, filtered_geo_pop, how="inner", op="within")
Out[48]:
district_name subdistrict_name health_facil_count total_area health_facility/sq.km percent_immunized_child geometry
0 Chhindwara Amarwara 4.0 2994.261732 0.001336 70.6 POLYGON ((8136551.045 4848193.458, 8136662.527...
1 Gondiya Amgaon 1.0 1105.078742 0.000905 71.0 POLYGON ((8404800.713 4832755.934, 8404840.361...
2 Gondiya Arjuni Morgaon 3.0 3585.795810 0.000837 82.4 POLYGON ((8447580.202 4719914.218, 8448069.824...
3 Balaghat Baihar 7.0 9699.919428 0.000722 74.2 POLYGON ((8362487.172 5026853.348, 8362637.339...
4 Balaghat Balaghat 17.0 4077.470261 0.004169 66.0 POLYGON ((8270767.000 4931971.672, 8271153.673...
In [49]:
from scipy.stats import pearsonr

# Calculate correlation coefficient and p-value
correlation_coefficient, p_value = pearsonr(health_facility_density_df['health_facility/sq.km'], health_facility_density_df['percent_immunized_child'])

# Print results
print(f'Correlation coefficient: {correlation_coefficient}')
print(f'P-value: {p_value}')
Correlation coefficient: -0.045525608575021206
P-value: 0.5737801079884187
In [50]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='health_facility/sq.km', y='percent_immunized_child', data=health_facility_density_df)
plt.title('Health Facility Density vs. Percentage of Immunized Children')
plt.xlabel('Health Facility Density (per sq. km)')
plt.ylabel('Percentage Children (0-3) Immunized')
plt.xlim(0, 0.001)  # Set x-axis limits
plt.grid(True)
plt.show()

Principle of excellence:

  1. Clear purpose and clear label on axis and title (I do not use the variable name in the dataframe, but use true name of the variable).
  2. Select the best graph type (use scatter plot to see the scatterness of each dot of the data)
  3. Clear colour and easy to read

Reflection

As we can see above that the p-value > 5%, hence the correlation is not significance. Thus, there might not any relation between Percentage of Children with Immunization and Health related facility per sq km (H4). H4 is rejected!

Critical Data Science Reflection

  1. Inclusivity: the data of health related facility is also normalized to per sq km.
  2. However, there is some possibility of bias!! Less health facility can be resulted because some geographical terrain (mountainous, lakes, etc). It will be further explored on next research. But this will be noted!

Part 05: Exploratory Spatial Data Analysis ¶

In this part, main activities are:

  1. Spatial Weight
  2. ESDA Global
  3. ESDA Local

This part is mainly to test final Hypothesis, the H5.

H5: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization. The immunization practices of one area (subdistricts) may impact the immunization behavior of nearby area. It popped out from the idea that communities with higher immunization rates create a positive norm that influences neighboring communities.

Main idea: to check how certain area is correlated to its neighbour in the context of "percentage of immunization on children (0-3 years)", we will both do Global an Local Autocorrelation.

1. Creating Spatial Weight

Before doing ESDA on both Global and Local, we need to create spatial weight matrix. Here, we will use "queen" which defines neighbors as those sharing a common boundary or vertex. The "queen" is used because it is less sensitive to irregular shapes (especially regions) in the spatial distribution of the units

In [51]:
variables = 'pc11_district_id'
weight_df = geo_pop_district

# Check for duplicate entries
if weight_df[variables].duplicated().any():
    # If duplicates exist, drop them or handle them as per your data requirements
    weight_df = weight_df.drop_duplicates(subset=variables)

# Set the index
weight_df = weight_df.set_index(variables, drop=False)

# Create spatial weights matrix
w_queen = Queen.from_dataframe(weight_df, idVariable=variables)
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/3861510448.py:13: FutureWarning: `idVariable` is deprecated and will be removed in future. Use `ids` instead.
  w_queen = Queen.from_dataframe(weight_df, idVariable=variables)
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: 
 There are 25 disconnected components.
 There are 5 islands with ids: 71.0, 88.0, 528.0, 639.0, 640.0.
  warnings.warn(message)

As you can see above, the number of island is quite low, in which we can ignore for further exploration. The total number of district here is 604, which makes 5 island is only less than 1% (ignore them will not much affect the data).

In [52]:
w_queen.islands
Out[52]:
[71.0, 88.0, 528.0, 639.0, 640.0]
In [53]:
weight_df.set_index('pc11_district_id')
weight_df = weight_df.drop(w_queen.islands)

Once we have the set of local authorities that are not an island, we need to re-calculate the weights matrix:

In [54]:
w_queen = Queen.from_dataframe(weight_df, idVariable=variables)
w_queen.transform = 'R'
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/3237309050.py:1: FutureWarning: `idVariable` is deprecated and will be removed in future. Use `ids` instead.
  w_queen = Queen.from_dataframe(weight_df, idVariable=variables)
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: 
 There are 20 disconnected components.
  warnings.warn(message)

Also check the plot to see how the neighbours number are spreaded! As we can see on graph below, average number of neighbour in certain districts are around 4.

In [55]:
queen_card = pd.Series(w_queen.cardinalities)
sns.distplot(queen_card, bins=10)
plt.title('Distribution of Cardinalities for Queen Contiguity')
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/3785541745.py:2: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(queen_card, bins=10)
Out[55]:
Text(0.5, 1.0, 'Distribution of Cardinalities for Queen Contiguity')

Principle of excellence:

  1. Clear purpose and clear label on axis and title (I do not use the variable name in the dataframe, but use true name of the variable).
  2. Select the best graph type (use histogram plot to see distribution and tendency)
  3. Clear colour and easy to read

2. Spatial Lag

Next, we will determine the Spatial Lag.

We will create some new variables:

  1. weight of the dependent variable: percent_immunized_child
  2. normalized the variable no 1 above.
In [56]:
weight_df['w_percent_immunized_child'] = weights.lag_spatial(w_queen, weight_df['percent_immunized_child'])
In [57]:
# Check whether it works or not :D
weight_df[['pc11_district_id', 'district_name','percent_immunized_child', 'w_percent_immunized_child']].head(5)
Out[57]:
pc11_district_id district_name percent_immunized_child w_percent_immunized_child
pc11_district_id
468.0 468.0 Kachchh 68.4 69.75
469.0 469.0 Banas Kantha 68.5 65.15
470.0 470.0 Patan 61.9 68.45
471.0 471.0 Mahesana 82.7 72.00
472.0 472.0 Sabar Kantha 76.1 72.60

For example, let us check if some district have neighbour

In [58]:
w_queen.neighbors[468.0]
Out[58]:
[477.0, 476.0, 469.0, 470.0]

Next, create the normalized variables (std)

In [59]:
weight_df['percent_immunized_child_std'] = (weight_df['percent_immunized_child'] - weight_df['percent_immunized_child'].mean()) / weight_df['percent_immunized_child'].std()
In [60]:
weight_df['w_percent_immunized_child_std'] = weights.lag_spatial(w_queen, weight_df['percent_immunized_child_std'])

Moran Plot

Finally, we will see how the w_percent_immunized_child_std correlates with percent_immunized_child_std

In [61]:
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))
# Plot values
sns.regplot(x='percent_immunized_child_std', y='w_percent_immunized_child_std', data=weight_df, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
# Display
plt.title('Moran Plot of Standardized Childs Immunization Percentages')
plt.show()

Principle of excellence:

  1. Clear purpose and clear label on axis and title.
  2. Select the best graph type (use scatter plot to see the scatterness of each dot of the data)
  3. Clear colour and easy to read

In the context percentage of immunization on children above, the Moran Plot can be interpreted along the lines of: districts display positive spatial autocorrelation in the way provide immunization to childrens. This means that the districts with high percentage of immunization tend to be located nearby other districts where a significant percentage of immunization happened, and viceversa.

Moran I

In order to calculate Moran's I in our dataset, we can call a specific function in PySAL directly:

In [62]:
mi = esda.Moran(weight_df['percent_immunized_child'], w_queen)
print("Moran I value = ")
mi.I
Moran I value = 
Out[62]:
0.586662400879172
In [63]:
moran_scatterplot(mi);

Principle of excellence:

  1. Clear purpose and clear label on axis and title.
  2. Select the best graph type (use scatter plot to see the scatterness of each dot of the data)
  3. Clear colour and easy to read

Local Spatial autocorrelation

Moran's I provides an indication of overall clustering but doesn't pinpoint the specific locations of clusters. To identify the spatial distribution at a finer level, local measures of spatial autocorrelation are crucial. Unlike global measures that analyze the entire dataset, local measures operate on individual observations, offering detailed insights instead of summarizing the entire map.

In [64]:
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))
# Plot values
sns.regplot(x='percent_immunized_child_std', y='w_percent_immunized_child_std', data=weight_df, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
plt.text(1.75, 0.5, "HH", fontsize=25)
plt.text(1.5, -1.5, "HL", fontsize=25)
plt.text(-2, 1, "LH", fontsize=25)
plt.text(-1.5, -2.5, "LL", fontsize=25)
# Display
plt.show()

Next identify cases in which the comparison between the value of an observation and the average of its neighbors is either more similar (HH, LL) or dissimilar (HL, LH) than we would expect from pure chance. In order to do so, we need LISA

In [65]:
lisa = esda.Moran_Local(weight_df['percent_immunized_child'], w_queen)

All we need to pass is the variable of interest -percentage of Leave votes- and the spatial weights that describes the neighborhood relations between the different observation that make up the dataset.

Next, to make the map making more straightforward, it is convenient to pull them out and insert them in the main data table (weight_df)

In [66]:
# Break observations into significant or not
weight_df['significant'] = lisa.p_sim < 0.05
# Store the quadrant they belong to
weight_df['quadrant'] = lisa.q

Next, create the significant of the dataframe, followed by the quadrant.

In [67]:
weight_df['significant'].head(20)
Out[67]:
pc11_district_id
468.0    False
469.0    False
470.0    False
471.0    False
472.0    False
473.0    False
474.0    False
475.0    False
476.0    False
477.0    False
478.0    False
479.0    False
480.0    False
481.0    False
482.0    False
483.0    False
484.0    False
485.0    False
486.0    False
487.0     True
Name: significant, dtype: bool
In [68]:
weight_df['quadrant'].head()
Out[68]:
pc11_district_id
468.0    3
469.0    3
470.0    3
471.0    1
472.0    1
Name: quadrant, dtype: int64

The correspondence between the numbers in the variable and the actual quadrants is as follows:

  • 1: HH
  • 2: LL
  • 3: LH
  • 4: HL

HH (High-High): Explanation: Districts in the High-High quadrant have high immunization percentages and are surrounded by other districts with high immunization percentages. Conclusion: This indicates spatial clustering of high immunization rates, suggesting a positive spatial autocorrelation. High-performing districts are geographically close to other high-performing districts.

LL (Low-Low): Explanation: Districts in the Low-Low quadrant have low immunization percentages and are surrounded by other districts with low immunization percentages. Conclusion: This suggests spatial clustering of low immunization rates, indicating a negative spatial autocorrelation. Districts with low immunization rates tend to be close to other districts with similarly low rates.

LH (Low-High): Explanation: Districts in the Low-High quadrant have low immunization percentages but are surrounded by districts with high immunization percentages. Conclusion: This pattern indicates spatial outliers where districts with low immunization rates are in close proximity to districts with high rates. It may be an area of concern or a unique spatial pattern.

HL (High-Low): Explanation: Districts in the High-Low quadrant have high immunization percentages but are surrounded by districts with low immunization percentages. Conclusion: Similar to LH, this pattern represents spatial outliers where districts with high immunization rates are in close proximity to districts with low rates. This, too, could be an area of concern or a unique spatial pattern.

Next, With these two significant and quadrant, we can build a typical LISA cluster map:

In [69]:
lisa_cluster(lisa, weight_df);

Graph Excellence

  1. Use decent colour and contrast (easy to read)
  2. Wise use of space
  3. Clear legend
In [70]:
# In more detailed graph

# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))

# Plot insignificant clusters
ns = weight_df.loc[weight_df['significant']==False, 'geometry']
ns.plot(ax=ax, color='k')

# Plot HH clusters
hh = weight_df.loc[(weight_df['quadrant']==1) & (weight_df['significant']==True), 'geometry']
hh.plot(ax=ax, color=sns.color_palette("coolwarm", as_cmap=True)(0.8))  # Red

# Plot LL clusters
ll = weight_df.loc[(weight_df['quadrant']==3) & (weight_df['significant']==True), 'geometry']
ll.plot(ax=ax, color=sns.color_palette("coolwarm", as_cmap=True)(0.2))  # Blue

# Plot LH clusters
lh = weight_df.loc[(weight_df['quadrant']==2) & (weight_df['significant']==True), 'geometry']
lh.plot(ax=ax, color=sns.color_palette("coolwarm", as_cmap=True)(0.5))  # Greenish-Blue

# Plot HL clusters
hl = weight_df.loc[(weight_df['quadrant']==4) & (weight_df['significant']==True), 'geometry']
hl.plot(ax=ax, color=sns.color_palette("coolwarm", as_cmap=True)(0.35))  # Light Red

# Style and draw
f.suptitle('LISA for Immunization Percentage on Children 0-3 yrs in India', size=14)
f.set_facecolor('0.75')
ax.set_axis_off()

plt.show()

Graph Excellence

  1. Use decent colour and contrast (easy to read)
  2. Wise use of space
  3. Clear legend
In [71]:
plot_local_autocorrelation(lisa, weight_df, 'percent_immunized_child');

Graph Excellence

  1. Use decent colour and contrast (easy to read and blind friendly)
  2. Wise use of space
  3. Clear legend

Reflection

Regarding to H5:

  1. Visually, based on Moral Plot above: the outliers (HL and LH) are relatively small (clusters are also small).
  2. Checking the significance using Moran I (see below code), result: Spatial autocorrelation is statistically significant. meaning that H5 is accepted.
In [72]:
if mi.p_sim < 0.05:
    print("Spatial autocorrelation is statistically significant.")
    if mi.I > 0:
        print("Positive spatial autocorrelation (HH or LL).")
    else:
        print("Negative spatial autocorrelation (LH or HL).")
else:
    print("No statistically significant spatial autocorrelation detected.")
Spatial autocorrelation is statistically significant.
Positive spatial autocorrelation (HH or LL).

Part 06: Conclusion and Reflection ¶

After taking a long marathon, hearing my story, I would like to thanks to Trivik and TAs for answering my question in lab! and of course reading my story here.

Conclusion

Some hypothesis in this preliminary research are tested, but all of them needs to be evaluated with expertise such as Trivik to fully delete possible biases. So far:

  1. H1, H2, and H3 is accepted because of significance (via Pearson Correlation test). Method: plotting from CSV data from SHRUG.
  2. H4 is rejected because it has low significance (via Pearson Correlation test). Method: plotting from CSV data from SHRUG GPKG + Get new variable "health related facilities per sq km2" from OSMNx Amenities. It also visually validated from the geomap.
  3. H5 is accepted because it has significance (via Pearson Correlation test on Moran I). Method: SHRUG GPKG + CSV data and do autocorrelation analysis.

Final words: Some important variables that readers can use further to analyze the level or percentage of immunization on children age 0-3 years:

  1. Percentage of lactating mothers/women receiving ICDS from government.
  2. Education Level
  3. Community NGOs: Self Help Groups (SHGs) [less significant]
  4. Effect of neighbouring districts

Reflection using Critical Data Science Framework

Important notes to the user of this preliminary research, in context of "Communicat e Findings Transparency and accessibility of the results"

  1. There are potential biases occurs here, because we assume that the independent variables are represented by data from the SHRUG.
  • Education factor represented by Percentage of Graduates and Post Graduates in certain area (per capita)
  • Economics condition represented by Percentage of Women Lactating that receives subsidy ICDS in certain area (per capita)
  • Social condition represented by Percentage of SHGs per capita
  • Effect of neighbouring districts only count in variable "children immunization percentage/ coverage".
  1. Inclusivity:
  • The data used for testing H1-H3 and H5 used all data in global India (less than 5% was cut because of the outliers). Thus reflects the representativeness.
  • The data used for testing H4 use case study on three district that is chosen: highest, lowest, and average 'immunized children percentage' districts, hoping to represents all condition in India. However, since H4 lack of geospatial analysis (for example mountanious feature), there might be bias.
  • Every code is explained and every graph has good practice of excellence, hoping to increase "explainability".
  • Data is from opensource, and code here is accessible, thus expected to increase transparency.
  • Transparency: no data is hidden to "protect", most of the cases are objective.
  1. Inequality:
  • The data was used as inclusive as possible, but please consider that it does not necessary mean the data is just.
  • Possibility to use findings in good way: used by government to push more on ICDS because it has significant correlation with immunization level of children 0-3 years.
  • Possibility to use findings in bad way: stigmatize or unfairly label specific communities or demographic groups, which can lead to discrimination and social divisions. Also: exploiting research findings for marketing purposes (sell vaccines to certain area), potentially leading to the promotion of unproven or unsafe immunization-related products or services.
  1. Participation:
  • All graphs has at least 3 practice of graphical excellence, hoping to be understandable to all of stakeholders.
  1. Power:
  • Possibilities to use findings in bad way politically: politicizing research findings to gain support or advance political agendas, leading to inappropriate policy decisions that may not prioritize public health.
  1. Positionality:
  • I am researcher (well, in this case, start to research as preliminary researcher) want to unveil the data so that government can push the coverage of immunization of children 0-3 years because health is critical factors for human life.
  1. The result can be enhanced? How? Yes, we can enhance the data especially all factors of H1, H2, H3, H5.
  • By choosing direct reprenting column instead of proxy.
  • Analyze geographical feature. Some districts with low amenities (health related) need to be analyzed its nature (mostly mountains so that not much of people?) instead of taking it out.